Statistical Disclosure Control

Microdata: new masking methodology (WP1.1)

Leading partner: URV

Participating partners: Istat, UoP, StBa, CISC, URV

This workpackage is devoted to the research and development of new statistical disclosure control (SDC) methods to be included in the new release of the µ-ARGUS SDC package. Different approaches to microdata protection, most of them based on masking, will be explored and tested in the various tasks of this workpackage. A prototype implementation will be produced for each approach that can be taken as input by WP 2 (Microdata: software development) for integration into µ-ARGUS and by WP 5 (Methodology testing). The following are the objectives of the workpackage broken down by tasks:

Task T1 (responsible ISTAT and UoP)

Objectives

To build a new framework for statistical disclosure control for business microdata different from the usual framework based on matrix masking methodology. To design a matching algorithm to check the effectiveness of the proposed methodology. To improve µ-ARGUS by providing integratable implementations of the new methods defined in this task.

Description of the work

We propose to develop a methodology for statistical disclosure control for business microdata based on model estimates. A statistical model for quantitative variables that takes account of the geographical area to which each enterprise belongs will be built. We propose limiting disclosure of sensitive quantitative variables by releasing predictive intervals or other summaries of the predictive density associated with the model. The estimates of area effects from the model will suggest a broader categorisation to use when releasing the variable geographical area that goes a long way to minimising information loss. In order to check that the suggested protection measures are indeed sufficient we will develop a specially designed matching algorithm. Software to implement the methodology proposed in this task will be developed using S-Plus or SAS, so that the statistical and graphical capabilities of those packages can be utilised.

Milestones and expected result:
A new methodology for Statistical Disclosure Limitation of business microdata

Task T2 (responsible StBa)

Objectives

The aim of this task is to improve the masking algorithm designed by Sullivan (see references in the Description of Work below) in order to make it applicable in practice. Extensions to the algorithm will have to be implemented which lead to useful results for complex data structures. Furthermore, it is necessary to develop a modified algorithm which can be used for partial masks. The former is necessary for combining masking with other SDC techniques and the latter allows to reflect dependencies, especially filters in questionnaires.

Description of the work

In a first step, the structure of Sullivan's masking algorithm (Sullivan 1989, Fuller 1993) will be extended in order to integrate partial masks and fixed sets of values. Partial masking can be integrated by imposing linear restrictions during the mask and the iterative correction procedures. Integrating fixed dependencies will be incorporated in a similar way with other kinds of linear restrictions. For instance, we intend to test whether masking is effective (with respect to an internal distance criterion) if the values of the 'dependent' variables are fixed, because this strategy would reduce the computing time. The second step is the development of a revised data set from some business statistics (e.g. VAT statistics, cost-structure statistics) for a first application of the algorithm. This will be derived from the experience of statistical experts, who have been working with the data. For example, variables that are rarely used should be excluded and rarely occurring categories should be collapsed. This procedure can be viewed as a initial work toward the development of a scientific-use file. The third step is to apply the masking algorithm to the revised data set. Analyses with standard techniques are only valid if the whole masked (sub-)sample is included or if some separately masked sub-samples are analysed together. That is why well-defined subsamples should be masked.

References
Fuller, W. A. (1993) Masking procedures for microdata disclosure limitation, Journal of Official Statistics, 9, 383-406.
Sullivan, G. R. (1989) The Use of Added Error to Avoid Disclosure in Microdata Releases, unpublished Ph. D. Thesis, Iowa State University.

Milestones and expected results
Modifications on Sullivan's algorithm concerning partial masks and fixed sets of variables.
Applicability tests of the new algorithm.
It is expected that the new algorithm can be applied to relatively complex data structures and that partial masks can be successfully performed for small and medium sized companies.

Task T3 (responsible URV)

Objectives

Microaggregation is the most used technique for protecting quantitative microdata. The main objective of this task is to move beyond the current state of the art of microaggregation methods by developing advanced algorithms, namely data-oriented microaggregation and microaggregation of unprojected multidimensional data. A second objective is to provide C/C++ implementations of the new algorithms that can be included in µ-ARGUS. A third objective is to characterise the computational complexity of exact optimal microaggregation, which will provide a theoretical justification for the use of heuristic methods. A final objective is to compare the performance of the new algorithms developed in this task against those developed in T2.

Description of the work

This task will move beyond the current state of the art in microaggregation. Such state of the art includes individual ranking, single-axis methods, weighted moving averages and multivariate methods (Defays and Nanoupoulos, 1993; Mateo and Domingo, 1999); in all cases, both fixed-size and variable-size groups can be considered and, for each approach, there is a tradeoff between information loss (data utility) and confidentiality protection (data safety). Current users of microaggregation, like Eurostat and others (e.g. Corsini et al., 1999; Nechaeva and Sokolov, 1996, etc.) will benefit from the output of this task, whose main aim is to develop new algorithms for advanced microaggregation. Efficient algorithms are known for single-axis, individual ranking and weighted moving average methods as long as the group size is kept fixed. The following subtasks will be tackled: a) Development of new algorithms for the variable group size versions of known microaggregation approaches; b) Development of new algorithms for multivariate microaggregation of unprojected data. Programs to implement all microaggregation algorithms developed will be written. Another important issue that will be dealt with is the justification of the use of heuristics for microaggregation. The idea is to characterise the computational complexity of exact optimal microaggregation, which is conjectured to be NP-hard. Microaggregation is a special case of record masking. Therefore, this task would not be complete without a comparison with other masking methods that will be included in µ-ARGUS (e.g. methods developed under task T2 of this workpackage). The comparison will be in terms of information loss and safety achieved. Finally, we expect to disseminate the results obtained in a number of high-quality scientific publications.

References

Corsini, V., Franconi, L., Pagliuca, D., and Seri, G., (1999). An application of microaggregation methods to Italian business surveys. In: Statistical Data Protection'98, Luxembourg:OPOCE, pp. 109-113.
Defays, D., and Nanopoulos, P., (1993) Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 92 Symposium on Design and Analysis of Longitudinal Surveys, Ottawa: Statistics Canada, pp. 195-204.
Mateo-Sanz, J. M., and Domingo-Ferrer, J., (1999) A method for data-oriented multivariate microaggregation. In: Statistical Data Protection'98, Luxembourg: OPOCE, pp. 89-99.
Nechaeva, E., and A. Sokolov (1996) Utilization of microaggregation methods for providing confidentiality for data on R+D institutions. In: Proceedings of the 3rd International Seminar on Statistical Confidentiality, Eurostat-Statistical Office of the Republic of Slovenia, pp. 218-225.

Milestones and expected results

- New microaggregation algorithms to improve on the current ones and on other masking techniques with regard to information loss and disclosure risk.
- Prototype software to implement the new algorithms.
It is expected to come up with microaggregation algorithms that optimise the trade-off between information loss and data safety. True multivariate microaggregation (without data projection) looks especially unexplored and promising. Prototype implementations produced will be ready for integration into µ-ARGUS. It is expected that results will be publishable in high-quality scientific journals

Task T4 (responsible CISC)

Objectives

µ-ARGUS currently offers two SDC techniques for protecting categorical microdata: global recoding and local suppression. None of both techniques uses any information about the categories (or the domain) corresponding to a certain variable. If a variable is known to have as a domain a set of ordered categories (linguistic terms), microaggregation is a feasible alternative. A possible approach is to microaggregate by first translating the ordered categories into a numerical scale. The aforementioned translation corresponds to an implicit settlement of the semantics of the ordered categories. The aim of this task is to extend µ-ARGUS with mechanisms to explicit this semantics and to provide the corresponding aggregation tools for categorical microdata.

Description of the work

Our approach to defining methods for qualitative aggregation is based on the two-stage procedure of qualitative aggregation: (i) semantics determination and (ii) aggregation function selection. In the semantics determination stage, we plan to develop a set of tools to describe the semantics of linguistic labels. This semantics will be represented as metadata of the tables. Three types of semantics will be considered: a) Explicit interval selection (Moore, 1966), b) Explicit fuzzy interval selection (Klir and Yuan, 1995), and c) Implicit selection from pairs of antonyms (Valls and Torra, 1999). Successful completion of this stage either requires the reader of µ-ARGUS to be modified (so that some information is included in the metafile), or an extra metafile to be included, or information to be introduced by the user in an ad-hoc menu. In the aggregation function selection stage, we plan to develop a set of microaggregation functions for qualitative values on the basis of the three models of semantics description mentioned above. We expect to disseminate the results obtained in this task in high-quality scientific publications.

References

Klir, G., Yuan, B., (1995), Fuzzy Sets and Fuzzy Logic: Theory and Applications, Prentice-Hall, U.K.
Moore, R. E., (1966), Interval Analysis, Prentice-Hall, Englewood Cliffs, NJ.
Valls, A., Torra, V., (1999), On the semantics of qualitative attributes in knowledge elicitation, Int. J. of Intelligent Systems 14:2 195-20.

Milestones and expected result

- Development of aggregation functions based on three different semantics.
- Description of the aggregation functions.
- Prototype software to implement the new aggregation algorithms for SDC.
It is expected to come up with SDC methods for qualitative variables which make use of the semantics of the variable categories. Prototype implementations will be ready for integration into µ-ARGUS. It is expected that results will be publishable in high-quality scientific journals.